AITopics | shannon entropy

Constant Regret, Generalized Mixability, and Mirror Descent

Neural Information Processing SystemsMar-16-2026, 23:32:27 GMT

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and ``mixing'' algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $\Phi$, losses for which a constant regret is possible using the \textsc{GAA} are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously left as an open question. We fully characterize $\Phi$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\Phi$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

artificial intelligence, machine learning, proceedings, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

d62e65cfdba247e0cd7cac5964f9fbd9-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 08:35:30 GMT

artificial intelligence, keystep, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Instructional Material > Course Syllabus & Notes (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

62e5721247075dd097023d077d8e22f7-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 10:15:09 GMT

algorithm, complexity, entropy, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Baltimore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
(2 more...)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AnInformation

Neural Information Processing SystemsFeb-8-2026, 11:54:35 GMT

We study the Bayesian regret of the renowned Thompson Sampling algorithm incontextual bandits with binary losses and adversarially-selected contexts.

artificial intelligence, information ratio, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

3d84d9b523e6e82916d496e58761002e-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 11:54:32 GMT

bandit, contextual bandit, information ratio, (14 more...)

Neural Information Processing Systems

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Data Science (0.69)

Add feedback

Generalizing Bayesian Optimization with Decision-theoretic Entropies

Neural Information Processing SystemsDec-24-2025, 16:02:02 GMT

Bayesian optimization (BO) is a popular method for efficiently inferring optima of an expensive black-box function via a sequence of queries. Existing information-theoretic BO procedures aim to make queries that most reduce the uncertainty about optima, where the uncertainty is captured by Shannon entropy. However, an optimal measure of uncertainty would, ideally, factor in how we intend to use the inferred quantity in some downstream procedure. In this paper, we instead consider a generalization of Shannon entropy from work in statistical decision theory (DeGroot 1962, Rao 1984), which contains a broad class of uncertainty measures parameterized by a problem-specific loss function corresponding to a downstream task. We first show that special cases of this entropy lead to popular acquisition functions used in BO procedures such as knowledge gradient, expected improvement, and entropy search. We then show how alternative choices for the loss yield a flexible family of acquisition functions that can be customized for use in novel optimization settings. Additionally, we develop gradient-based methods to efficiently optimize our proposed family of acquisition functions, and demonstrate strong empirical performance on a diverse set of sequential decision making tasks, including variants of top-$k$ optimization, multi-level set estimation, and sequence search.

decision-theoretic entropy, generalizing bayesian optimization, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Chung, Woojin, Kim, Jeonghoon

arXiv.org Artificial IntelligenceDec-1-2025

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.1539

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Constant Regret, Generalized Mixability, and Mirror Descent

Neural Information Processing SystemsNov-20-2025, 22:48:36 GMT

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and ``mixing'' algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $\Phi$, losses for which a constant regret is possible using the \textsc{GAA} are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously left as an open question. We fully characterize $\Phi$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\Phi$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

constant regret, entropy, generalized mixability, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Constant Regret, Generalized Mixability, and Mirror Descent

Zakaria Mhammedi, Robert C. Williamson

Neural Information Processing SystemsNov-20-2025, 19:17:46 GMT

Under this setting, and for the right choice of loss function and "mixing" algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds.

artificial intelligence, loss null, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.96)

Add feedback

WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking

Park, Shinwoo, Park, Hyejin, Ahn, Hyeseon, Han, Yo-Sub

arXiv.org Artificial IntelligenceNov-14-2025

Large language models now draft news, legal analyses, and software code with human-level fluency. At the same time, regulations such as the EU AI Act mandate that each synthetic passage carry an imperceptible, machine-verifiable mark for provenance. Conventional logit-based watermarks satisfy this requirement by selecting a pseudorandom green vocabulary at every decoding step and boosting its logits, yet the random split can exclude the highest-probability token and thus erode fluency. WaterMod mitigates this limitation through a probability-aware modular rule. The vocabulary is first sorted in descending model probability; the resulting ranks are then partitioned by the residue rank mod k, which distributes adjacent-and therefore semantically similar-tokens across different classes. A fixed bias of small magnitude is applied to one selected class. In the zero-bit setting (k=2), an entropy-adaptive gate selects either the even or the odd parity as the green list. Because the top two ranks fall into different parities, this choice embeds a detectable signal while guaranteeing that at least one high-probability token remains available for sampling. In the multi-bit regime (k>2), the current payload digit d selects the color class whose ranks satisfy rank mod k = d. Biasing the logits of that class embeds exactly one base-k digit per decoding step, thereby enabling fine-grained provenance tracing. The same modular arithmetic therefore supports both binary attribution and rich payloads. Experimental results demonstrate that WaterMod consistently attains strong watermark detection performance while maintaining generation quality in both zero-bit and multi-bit settings. This robustness holds across a range of tasks, including natural language generation, mathematical reasoning, and code synthesis. Our code and data are available at https://github.com/Shinwoo-Park/WaterMod.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.07863

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology: